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Abstract 



In this paper we introduce deep Gaussian process (GP) models. Deep GPs are a 
deep belief network based on Gaussian process mappings. The data is modeled 
as the output of a multivariate GP. The inputs to that Gaussian process are then 
governed by another GP. A single layer model is equivalent to a standard GP or 
the GP latent variable model (GPLVM). We perform inference in the model by 
approximate variational marginalization. This results in a strict lower bound on 
the marginal likelihood of the model which we use for model selection (number 
of layers and nodes per layer). Deep belief networks are typically applied to rela- 
tively large data sets using stochastic gradient descent for optimization. Our fully 
Bayesian treatment allows for the application of deep models even when data is 
scarce. Model selection by our variational bound shows that a five layer hierarchy 
is justified even when modelling a digit data set containing only 150 examples. 



1 Introduction 

Probabilistic modelling with neural network architectures constitute a well studied area of machine 
learning. The recent advances in the domain of deep learning [Hinton and Osindero, 2006, Bengio 
et al.| |2012| have brought this kind of models again in popularity. Empirically, deep models seem 



to have structural advantages that have been shown to improve learning in complicated datasets 
associated with abstract information [Bengio, 2009]. Many interesting variants of the traditional 
neural networks have emerged since the development of efficient training methodologies for deep 



architectures: deep belief networks, (stacked) restricted Boltzmann machines [Hinton, 2010) etc 



Gaussian process models were introduced in the machine learning community as a fully probabilistic 
substitute for the multilayer perceptron, inspired by the observation |Neal[ [l996] that a GP is a mul- 
tilayer perceptron with infinite basis functions. It seems prudent, therefore, to investigate how these 
models will perform in deep hierarchies. Inference in deep GP models is analytically intractable. 



Lawrence and Moore [2007 ] investigated a maximum a posteriori (MAP) approach for estimation 



of latent variables. For the MAP approach to work, however, a strong prior was required on the top 
level of the hierarchy. There are two main contributions in this paper. Firstly, we show how we can 
approximately marginalize the latent variables variationally. This gives a rigorous lower bound on 
the marginal log likelihood of a deep GP. We test the resulting approach on a toy problem and then 
recreate results of Lawrence and Moore [ 2007 ] on a motion capture example. Our second contri- 
bution is to demonstrate the applicability of deep models even when data is scarce. The variational 
lower bound gives us an objective measure from which we can select different structures for our 
deep hierarchy (number of layers, number of nodes per layer). In a simple digits example we find 
that the best lower bound is given by the model with the deepest hierarchy we applied (5 layers). 

The deep GP consists of a cascade of hidden layers of latent variables where each node acts as output 
for the layer above and as input for the layer below — with the observed outputs being placed in the 
leaves of the hierarchy. Gaussian processes govern the mappings between the layers. 
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A single layer of the deep GP is a Gaussian process latent variabl e model (G PLVM). Infe rence over 
the latent variables even in this model is intractable. However, Titsi as and Lawrence] f20 10] have 
shown that latent variables can be approximately marginalized allowing a variational lower bound 
on the likelihood to be computed. The appropriate size of th e latent space can be compute d using au- 
tomati c relevance determination (ARD) priors |Neal, 1996, Mac Kay et al. 1992]. Dam ianou et al. 



[ 2011] extended this approach by placing a Gaussian process prior over the latent space giving a 
Bayesian dynamical GPLVM. Here we extend that approach to allow us to approximately marginal- 
ize any number of hidden layers. We demonstrate how a deep hierarchy of Gaussian processes can 
be obtained by marginalising out the latent variables in the structure, obtaining a fully Bayesian 
training procedure along with a variational approximation to the true posterior of all the latent vari- 
ables given the outputs. The resulting model is very flexible, and we discuss a number of potential 
applications in further work. 

2 The Model 

We first consider standard approaches to modeling with GPs. We then extend these ideas to deep 
GPs by considering Gaussian process priors over the inputs to the GP model. 

2.1 Standard GP Modelling 

In the traditional probabilistic inference framework, we are given a set of training input-output pairs, 
stored in matrices X G 7Z Nx Q and Y G 7Z NxD respectively, and seek to estimate the unobserved, 
latent f unction / = J(x), responsible f or generating Y given X. In this setting, Gaussian processes 
(GP's) |Rasmussen and W illiams , 2006] can be employed as nonparametric prior distributions over 
the latent function /. More formally, we assume that each datapoint y n is generated from the 
corresponding /(x n ) by adding independent Gaussian noise, i.e. 

yn = /(x n ) + e n , e-A/*(0,cr e I), (1) 

and / is drawn from a Gaussian process, i.e. /(x) ~ QV (0, k{x,x')). This (zero-mean) Gaus- 
sian process prior only depends on the covariance function k operating on the inputs X. As 
we wish to obtain a flexible model, we only make very general assumptions about the form of 
the generative mapping / and this is reflected in the choice of the covariance function which de- 
fines the properties of this mapping. For example, an exponentiated quadratic covariance function 

k (xi, Xj) = (a se ) 2 exp ^ forces the latent functions to be infinitely smooth. We de- 

note any covariance function hyperparameters (such as (a 3e ,l) of the aforementioned covariance 
function) by 6. The collection of latent function instantiations, denoted by F = {f n }^> is normally 
distributed, allowing us to compute analytically the marginal likelihood Q 

f N 

P(X\X)= / n^(yn|fnMfn|x n )dF = Ar(r|0,K 7V7 v + a e 2 I), K NN = k(X,X). (2) 



Gaussian processes have also been used with success in unsupervised learning scenarios, where the 
input data X are not directly observed. The Gaussian process latent variable model (GP-LVM) 
[Lawrence 2005 2004[ provides an elegant solution to this problem by treating the unobserved 



inputs X as latent variables, while employing a product of D independent GPs as prior for the latent 
mapping. The assumed generative procedure now takes the form: 

Vnd = /d(x n ) + Cnd, (3) 

where e is again random Gaussian noise and F = {f^}£ =1 with f nd = /d(x n ). Given a finite 
dataset, the Gaussian process priors take the form of a Gaussian distribution 

D 

p(F\X) = Y[AT(f d \0,K NN ) (4) 

d=l 

which, in turn, allows for general non-linear mappings to be marginalised out analytically to obtain 
the likelihood p(Y \ X) = fldLi N{yd\X), analogously to equation ([2]). 



1 All probabilities involving / should also have in the conditioning set, but here we omit it for clarity. 
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2.2 Deep Gaussian Processes 



Our deep Gaussian proce ss architecture corresponds to a graphical model with three kinds of nodes, 

illustrated in figure [MM the leaf nodes Y e7Z NxD which are the only observed ones, the interme- 

L [y * at \ / . — w — ~w — . . — w — — n — > . 



diate latent spaces Xh G 7Z Nx ® h ,h = 1, ...,H — 1, where is the number of hidden layers, and 
the parent latent node Z = X# G 7Z Nx ® z . In our notation the parent node is separated from the 
rest of the hidden layers for notational clarity and because it can be constrained with a prior of our 
choice and, therefore, be involved in different mathematical expressions. In this deep architecture, 
all intermediate nodes act as inputs for the layer below (including the leaves) and as outputs for 
the layer above. For simplicity, consider a structure with only two hidden units, as the one depicted 



in figure | l|[b)| The generative process takes the form: 

ynd=fd(*n)+ e nd> d=l,...,D, X n G 7lP 

x nq =f x (z n ) + e£, <z=l,...,Q, z n eU Qz (5) 

and the intermediate node is involved in two Gaussian processes, f Y and f x , playing the role of an 
input and an output respectively: f Y ~ gv(0, k Y (X, X)) and f x ~ GV(0, k x (Z, Z)). This 
structure can be naturally extended vertically (i.e. deeper hierarchies) or horizontally (i.e. segmen- 
tation of each layer into different partitions of the output space), as we will see later in the paper. 
However, it is already obvious how each layer adds a significant number of model parameters (Xh) 
as well as a regularization challenge, since the size of each latent layer is crucial but has to be a priori 
defined. For this reason, unlike [Lawrence and Moore| |2007], we seek to variationally marginalise 
out the whole latent space. Not only this will allow us to obtain an automatic Occam's razor due to 
the Bayesian training, but also we will end up with a significantly lower number of model parame- 
ters, since the variational procedure only adds variational parameters. The first step to this approach, 
is to define automatic relevance determination (ARD) covariance functions for the GPs: 

fc(xi, Xj -) =aZ rd e-*Z?=i w *( Xi >*- x >>* )2 . (6) 

This covariance function assumes a different weight w q for each latent dimension and this can be 
exploited in a Bayesian training framework in order to "switch off" irrelevant dimensions by driv- 
ing their corresp onding weight to zero, t hus helping towards automatically finding the structure of 
complex models [Damianou et al. , 2012 ]. However, the nonlinearities introduced by this covariance 



function make the Bayesian treatment of this model challenging. Nevertheless, following recent 
non-standard variational inference methods we can define analytically an approximate Bayesian 
training procedure, as will be explained in the next section. 

2.3 Bayesian Training 

A Bayesian training procedure requires optimisation of the model evidence: 

logp(y) = log J P (Y\X)p(X\Z)p(Z)dXdZ. (7) 

When prior information is available regarding the observed data (e.g. their dynamical nature is 
known a priori), the prior distribution on the parent latent node can be selected so as to constrain the 
whole latent space through propagation of the prior density through the cascade. Here we take the 
general case where p(Z) = Af(Z\0, 1). However, the integral of equation ^ is intractable due to 
the nonlinear way in which X and Z are treated through the GP priors f Y and f x . As a first step, 
we apply Jensen's inequality to find a variational lower bound on log p(Y): 

lo 6P (Y) >T» = j Slog ^MWMF'IM fflaF y m 

where we expanded the likelihood and, additionally, we introduced a variational distribution Q, the 
form of which will be defined later on. The above integral is still intractable because X and Z 



still appear nonlinea rly in the p(F Y \X) and p(F x \Z) terms respectively. A key result of Titsias i 
and Lawrence| |2010| is that expanding the probability space of the GP prior p(F\X) with extra 



variables allows for priors on the latent space to be propagated through the nonlinear mapping /. 
More precisely, we augment the probability space of equation ([4} with K auxiliary pseudo-inputs 
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X G 7Z KxQ and Z G 7Z KxQz that correspond to a collection of function values U Y G 7Z KxD and 
U x G 7Z Kx Q respectively^] Following this approach, we obtain the augmented GP likelihood: 

p(Y\X,X) = p(Y\F Y )p(F Y \U Y ,X)p(U Y \X)dF Y dU Y (9) 

and similarly for the p(X\Z) term. The pseudo-inputs X and Z are known as inducing points, and 
will be dropped from our expressions from now on, for clarity. Note that F Y and U Y are draws 
from the same GP so that p(U Y ) andp(F y \U Y , X) are also Gaussian distributions. 

We are now able to define a variational distribution Q which, when combined with the new expres- 
sions for the augmented GP priors, results in a tractable variational bound. Specifically, we have: 

Q=p(F Y \U Y ,X)q(U Y )q(X)p(F x \U x ,Z) q (U x )q(Z). (10) 

We select q(U Y ) and q(U x ) to be free-form variational distributions, while q(X) and q(Z) are 
chosen to be Gaussian, factorised with respect to dimensions: 

Q Qz 
q(X) = l[M(tf,S?) and q{Z) = J] AT(/xf , Sf). (11) 

q=l 9=1 

By substituting equations ( fT0| ) and ^ back to ([5]), we see that the "difficult" terms p(F Y \U Y , X) 
and p(F x \U X , Z) cancel out in the fraction, leaving a quantity that can be computed analytically: 

T v = / Clog P(y^HU Y )p(X\F x , U X )P(U X ) P (Z) dXdZdF y dFXdU y dU ^ {n) 

More specifically, we can break the logarithm in equation §Y2\ by grouping the variables of the 
fraction in such a way that the bound can be written as: 

J 7 v=Sr + T X + n q(x) - KL [q(Z) || p(Z)} (13) 

where H represents the entropy with respect to a distribution, KL denotes the Kullback - Leibler 
divergence and, using (•) to denote expectations, 



g Y = g(Y, F Y 1 U Y 1 X)={ \ogp(Y\F Y ) + log ) (14) 



<l(U Y ) I p ( F Y\uY ,X)q(U Y )q(X) 



r x = r(X, F X ,U X ,Z)=( \ogp(X\F x ) + log ^Kv ) • (15) 



q(U x ) 



p{FX\UX,Z)q{UX)q{X)q{Z) 



Both terms only involve known Gaussian densi ties and are, thus, tractable. The gy term is the same 
as the bound found for the Bayesian GP-LVM [Titsi as and Law rence, 2010]. Since it it only involves 
expectations with respect to Gaussian distributions, the GP output variables are only involved in a 
quantity of the form YY T . Further, as can be seen from the above equations, the function r(-) is 
similar to g(-) but it involves expectations with respect to densities of all of its variables. Therefore, 
rx will involve X (i.e. the outputs of the top layer) in a term (IX 7 )^^^ = Y^=i Vq^q + Sq- 



3 Extending the hierarchy 

Although the main calculations were demonstrated in a simple hierarchy, it is easy to extend the 
model vertically, i.e. by adding more hidden layers, or horizontally, i.e. by considering conditional 
independencies of the latent variables belonging to the same layer. The first case only requires 
adding more rj functions to the variational bound, i.e. instead of a single term we will now 
have the a sum: ^f^ 1 r Xh , where r Xh = r(X h , F Xh , U Xh , X h +i), with X H = Z. 

Now consider the horizontal expansion scenario and assume that we wish to break the single la- 
tent space Xh, of layer h, to conditionally independent subsets. As long as the variational 
distribution q(Xh) of equation (pj} is chosen to be factorised in a consistent way, this is feasi- 
ble by just breaking the original rx h term of equation ( [T5] ) into the sum J2m=i r x^- ^ ms 
lows just from the fact that, due to the independence assumption, it holds that logp(X^|X^ + i) = 

2 The number of inducing points K does not need to be the same for all GPs 



4 



Y^m=i l^+i ) • Notice that the same principle can also be applied to the leaves by break- 

ing the gy term of the bound. This scenario arises when, for example we are presented with multiple 
different observation spaces which, however, we believe they have some commonality. For example, 
when the observed data are coming from a video and an audio recording of the same event. Given 
the above, the variational bound for the most general version of the model takes the form: 

M Y H-l M h H-l 

+ E E rg } + J2 U q{Xh) - KL (q(Z) \\ p(Z)) . (16) 

m=l h—1 m=l h—1 



Figure 1 c) shows the association of this objective function's terms with each layer of the hierarchy. 



Recall that each and term is associated with a different GP and, thus, is coming with its 
own set of automatic relevance determination (ARD) weights, which were described in equation ([6]). 



3.1 Deep multiple-output Gaussian processes 

The particular way of extending the hierarchies horizontally, as presented above, can be seen as 
a means of performing unsupervised multiple-output GP learning. This only requires assigning 
a different gy term (and, thus, associated ARD weights) to each vector y^, where d indexes the 
output dimensions. After training our model, we hope that the columns of Y that encode similar 
information will be assigned relevance weight vectors that are also similar. This idea can be extended 
to all levels of the hierarchy, thus obtaining a fully factorised deep GP model. 

This special case of our model makes the connection between our model's structure and neural 
network architectures more obvious: the ARD parameters play a role similar to the weights of neural 
networks, while the latent variables play the role of neurons which learn hierarchies of features. 
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X2) — K X\ 
layer 2 layer 1 
(a) 
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S-^H-l s-^Mh (m) 
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^JZ^^ layer H 
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\ / \t 
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y(i) . . . y(Afy) leaves 
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Figure 1 : Different graphical represent ation s of the Deep GP model: (a) shows the general architec- 
ture with a cascade of H hidden layers, [(b)] depicts a simplification of a two hidden layer hierarchy 
also demonstrating the corresponding GP mappings and [(c)] illustrates the most general case where 
the leaves and all intermediate nodes are allowed to form conditionally independent groups. The 
terms of the objective ( fT6] ) corresponding to each layer are included on the left of each level. 



3.2 Parameters and complexity 

In all graphical model variants shown in figure [TJ every arrow represents a generative procedure with 
a GP prior, corresponding to a set of parameters {X, 6, a e }. Further, each layer of latent variables 
corresponds to a variational distribution q(X) which is associated with a set of variational means 
and covariances, as shown in equation $TT\ . The parent node can have the same form as equation 
( pT| ) or can be constrained with a more informative prior which would couple the points of q(Z). 
For example, a dynamical prior would introduce Q x N 2 parameters which can, nevertheless, be 
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reparametrized using less variables [Damianou et al.||2011| . However, as is evident from equations 
<[10]) and ( ^2| ), the inducing points and the parameters of q{X) and q(Z) are variational rather than 
model parameters, something which significantly helps in regularizing the problem. Therefore, 
adding more layers to the hierarchy does not introduce many more model parameters. Moreover, as 
in common sparse methods for Gaussian processes [ Titsias , 2009], the complexity of each generative 
GP mapping is reduced from the typical 0(N 3 ) to 0(NM^J. 



4 Demonstration 

In this section we demonstrate the deep GP model in toy and real- world datasets. The model is 
initialised by performing dimensionality reduction in the observations to obtain the first hidden 
layer and then repeating this process for the next layers. To obtain these stacked initial spaces we 
experimented with PC A and the Bayesian GP-LVM, but the end result did not vary significantly. 

4.1 Toy data 

We first test our model in toy data, created by sampling from a three-level hierarchy of GPs. Figure[2] 
(a) depicts the true hierarchy: from the top latent layer two intermediate latent signals are generated. 
These, in turn, together generate 10— dimensional observations (not depicted here) through sampling 
of another GP. These observations are then used to train the following models: a deep GP, a simple 
stacked Isomap and a simple stacked PCA method, the results of which are shown in figures [2] (b, c, 
d) respectively. Note that only our method marginalises the latent space and, in contrast to the other 
two, it is not given any information about the dimensionality of each true signal in the hierarchy. 
Further, as can be seen in figure [2j our model not only finds the correct dimensionality for each 
hidden layer, but it also discovers latent signals which are closer to the real ones. 




(a) (b) (c) (d) 



Figure 2: Attempts to reconstruct the real data (fig. (a)) with our model (b), stacked Isomap (c) and 
stacked PCA (d). Our model can also find the correct dimensionalities automatically. 



4.2 Modeling human motion 



For our second dem onstration we consider the same motion capture dataset used from |Lawrence 
and Moore [2007], taken from the CMU MOCAP databasa^l The data, summarised in two 



62— dimensional sets of 78 points each, represent two subjects walking towards each other and 
performing a 'high-five'. We applied our method with a two-level hierarchy where the two observa- 
tion sets were taken to be conditionally independent given their parent latent layer. Therefore, we 
obtained three optimised sets of ARD parameters: one for each modality of the bottom layer (shown 
with bar graphs having bins of different colours/ widths for each modality, in figure |^b)| and one 
ARD weight set for the top node, as shown in figure ^ £) Our model discovered a common subspace 
in the intermediate layer, since for dimensions 2 and 6 both ARD sets have a non-zero value. This 
is expected, as the two subjects perform very similar motions with opposite directions. The ARD 
weights are also a means of automatically selecting the dimen sionality of each layer and sub space. 
This kind of modelling is impossible for a MAP method like [Lawre nce and Moore| [2007] which 
requires the exact latent structure to be given a priori. The full latent space learned by the afore- 
mentioned MAP method is plotted in figure [4] (d,e,f), where fig. (d) corresponds to the top latent 
space and each of the other two encodes information for each of the two interacting subjects. Our 
method is not constrained to two dimensional spaces, so for comparison we plot two-dimensional 



3 http://mocap.cs.cmu.edu. 
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projections of the dominant dimension s of each subspace in figure |4| (a,b,c). The similarity of the 
latent spaces is obvious. In contrast to Lawre nce and M oore [2007 ], we did not have to constrain 
the latent space with dynamics in order to obtain results of good quality. 

Further, we can sample from these spaces to see what kind of information they encode. Indeed, we 
observed that the top layer generates outputs which correspond to different variations of the whole 
sequence, while when sampling from the first layer we obtain outputs which only differ in a small 
subset of the output dimensions, e.g. those corresponding to the subject's hand. 




■L 



(b) 



(c) 



Figure 3: Figure (a) shows the deep GP model employed. Figure (b) shows the ARD weights for 
f Yl and f Y2 and figure (c) those for f x . 



v" x J 

(a) (b) (c) 



(d) 



(e) 



(f) 



Figure 4: Left (a,b,c): projections of the latent spaces discovered by our model, Right (d,e,f): the 
full latent space learned for the model of |Lawrence and M oore [ 2007 ]. 



4.3 Deep learning of digit images 

In our last experiment we demonstrate the ability of our model to learn latent features of increas- 
ing abstraction and we demonstrate the usefulness of an analytic bound on the model evidence as a 
means of evaluating the quality of the data fit for different choices of the overall depth of the hier- 
archy. We built a dataset consisting of 50 examples for each of the digits 0,1 and 6 taken from the 
USPS handwritten digit database. Each digit is represented as an image in 16 x 16 pixels. We exper- 
imented with models of depth ranging from 1 (equivalent to Bayesian GP-LVM) to 5 hidden layers 
and evaluated each model by measuring the nearest neighbour error in the latent features discovered 
in each hierarchy. Our finding was that the approximate evidence of the model was increasingwith 
the number of layers and so was the quality of the model in terms of nearest neighbour errors [J In- 
deed, the single-layer model made 5 mistakes even though it automatically decided to use 10 latent 
dimensions and the quality of the trained models was increasing with the number of hidden layers. 
Finally, only one point had a nearest neighbour of a different class in the 4— dimensional top le vel's 
feature space of a model with depth 5. A 2D projection of this space is plotted in figure [^fa)] The 
ARD weights for this model are depicted in figure [5] 

To further test the nature of the discovered lat ent fe atures, we generated outputs by sampling from 
each hidden layer. As can be seen in figure (J b) the first hierarchical layers encode very local 
features whereas the higher ones encode much more abstract information. 



5 Discussion and future work 

We have introduced a framework for efficient Bayesian training of hierarchical Gaussian process 
mappings for unsupervised learning. Our approach approximately marginalises out the latent space, 



The results were the same when we took into account the Bayesian Information Criterion. 
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Figure 6: Fig. (a) demonstrates the nearest neighbour class separation test on a deep GP model 



with depth 5. Fig. |(b)| shows outputs obtained when sampling from this model. The first two rows 
(top-down), which were sampled from layers 1 and 2 respectively, encode very local features, e.g. 
explaining if a zero is a closed circle or not, or how big the circle of a 6 is. We discovered many more 
local features when we sampled from different dimensions. On the other hand, when we sampled 
from the two dominant dimensions of the parent latent node (two rows in the bottom) we obtained 
much more varying outputs. Thus, the higher levels indeed encode much more abstract information. 



thus allowing for automatic structure discovery in the hierarchy. The method was able to success- 
fully learn a hierarchy of features which describe natural human motion and the pixels of handwritten 
digits. Our variational lower bound selected a deep hierarchical representation for handwritten digits 
even though the data in our experiment was relatively scarce (150 data points). 

In the future, we wish to test our model on various inference tasks, such as class conditional density 
estimation. Our method can also be used to improve existing deep algorithms, something which we 
plan to further investigate by incorporating ideas from past approaches. Indeed, previous efforts to 
combine G Ps with deep structu res were successful at unsupervised pre-training [Er han et al. 201Q[ 
or guiding [ Snoek et al. 2012) of traditional deep models. 



Although the experiments presented here considered only up to five layers in the hierarchy, the 
methodology is directly applicable to deeper architectures, with which we intend to experiment in 
the future. The marginalisation of the latent space allows for such an expansion with simultaneous 
regularisation. The variational lower bound allows us to make a principled choice between models 
trained using different initializations and with different numbers of layers. 

The deep hierarchy we have proposed could also be used with inputs governing the top layer of the 
hierarchy, leading to a powerful model for regression based on Gaussian processes, but which is 
not itself a Gaussian process. We expect such a model to have applications in multitask learning 
(where intermediate layers could learn representations shared across the tasks) and in modelling 
nonstationary data or data involving jumps. These are both areas where a single layer GP struggles. 
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